The Central Limit Theorem and Related Topics
The Central Limit Theorem
If measurements are obtained independently and come from a process with finite variance, then the distribution of their mean tends towards a Gaussian (normal) distribution as the sample size increases.
Figure: The standard normal density
Details
The Central Limit Theorem states that if are independent and identically distributed random variables with mean and (finite) variance , then the distribution of tends towards a normal distribution. It follows that for a large enough sample size , the distribution random variable can be approximated by .
The standard normal distribution is given by the p.d.f.
:
for
The standard normal distribution has an expected value of zero:
and a variance of:
If a random variable has the standard normal (or Gaussian) distribution, we write .
If we define a new random variable, , by writing , then has an expected value of , a variance of and a density (p.d.f.
) given by the formula:
This is general normal (or Gaussian) density, with mean and variance .
The Central Limit Theorem states that if you take the mean of several independent random variables, the distribution of that mean will look more and more like a Gaussian distribution (if the variance of the original random variables is finite).
More precisely, the cumulative distribution function of:
converges to , the cumulative distribution function.
Examples
If we collect measurements on waiting times, these are typically assumed to come from an exponential distribution with density
The Central Limit Theorem states that the mean of several such waiting times will tend to have a normal distribution.
We are often interested in computing:
which comes from a distribution (see below), if the are independent outcomes from a normal distribution.
However, if is large and is finite then values will look as though they came from a normal distribution. This is in part a consequence of the Central Limit Theorem, but also of the fact that will become close to as increases.
Properties of the Binomial and Poisson Distributions
The binomial distribution is really a sum of and values (counts of failures and successes ). So, a simple, single binomial outcome will correspond to coming from a normal distribution if the count is large enough.
Details
Consider the binomial probabilities:
for where is a non-negative integer. Suppose is a small positive number, specifically consider a sequence of decreasing -values, specified with and consider the behavior of the probability as . We obtain:
Notice that as . Also notice that as . Also
and it follows that
and hence the binomial probabilities may be approximated with the corresponding Poisson.
Examples
The mean of a binomial (n,p)
variable is and the variance is .
The R command dbinom(q,n,p)
calculates the probability of q
successes in n
trials assuming that the probability of a success is p
in each trial (binomial distribution), and the R command pbinom(q,n,p)
calculates the probability of obtaining q
or fewer successes in n
trials.
The normal approximation of this distribution can be calculated with pnorm(q,mu,sigma)
which becomes pnorm(q,np,sqrt(np(1-p))
.
Three numerical examples (note that pbinom and pnorm give similar values for large n):
> pbinom(3,10,0.2)
[1] 0.8791261
> pnorm(3,10*0.2,sqrt(10*0.2*(1-0.2)))
[1] 0.7854023
> pbinom(3,20,0.2)
[1] 0.4114489
> pnorm(3,20*0.2,sqrt(20*0.2*(1-0.2)))
[1] 0.2880751
> pbinom(30,200,0.2)
[1] 0.04302156
> pnorm(30,200*0.2,sqrt(200*0.2*(1-0.2)))
[1] 0.03854994
We are often interested in computing , which has a distribution if the are independent outcomes from a normal distribution. If is large and is finite, this will look as if it comes from a normal distribution.
The numerical examples below demonstrate how the distribution approaches the normal distribution.
> qnorm(0.7)
[1] 0.5244005 #This is the value which gives the cumulative probability of p=0.7 for a n~(0,1)
> qt(0.7,2)
[1] 0.6172134 #The value, which gives the cumulative probability of p=0.7 with n=2 for the t distribution.
> qt(0.7,5)
[1] 0.5594296
> qt(0.7,10)
[1] 0.541528
> qt(0.7,20)
[1] 0.5328628
> qt(0.7,100)
[1] 0.5260763
Monte Carlo Simulation
If we know an underlying process we can simulate data from the process and evaluate the distribution of any quantity based on such data.
Figure: A simulated set of -values based on data from an exponential distribution.
Examples
Suppose our measurements come from an exponential distribution and we want to compute
but we want to know the distribution of those when is the true mean.
For instance, and , we can simulate (repeatedly) and compute a -value for each. The following R commands can be used for this:
library(MASS)
n <-5
mu <-1
lambda <-1
tvec <-NULL
for(sim in 1:10000) {
x <-rexp(n,lambda)
xbar <-mean(x)
s <-sd(x)
t <-(xbar-mu)/(s/sqrt(n))
tvec <- c(tvec,t)
}
truehist(tvec) # truehist gives a better histogram
Show values at certain positions in the vector by uing:
> sort(tvec)[9750]
[1] 1.698656
> sort(tvec)[250]
[1] -6.775726